FIFA is a series of soccer video games developed and released annually through the Electronic Arts (EA) company label. While the game is extremely popular across the world, it has also come with its’ fair share of controversy. Much of this controversy arises from the fact that each player in the game will be assigned an overall rating from 1-99. As a former avid FIFA player, I have always wondered how developers decide on player ratings. Now, I have begun to ask “Can a machine do it too?” It is entirely possible that the developers at EA use a machine learning model to determine a player’s ratings. However, that did not stop me from building my very own. This project is my attempt to create a machine learning model that could predict the overall FIFA rating of any player.
At first glance, a model developed for a specific video game may not seem to be useful in the real world. However, such a model may have relevance beyond the FIFA series and into the greater sporting world. Determining a player’s overall rating can provide insight into the traits and attributes that soccer analysts/professionals value and desire. These same traits and attributes can be further examined in young players, thereby improving a team’s ability to find young/new talent.
I have utilized a large publicly available Kaggle dataset (https://www.kaggle.com/datasets/stefanoleone992/fifa-20-complete-player-dataset) containing player data from FIFA 15 to FIFA 20. The dataset consists of 18,278 rows and 104 columns. Each row contains a unique player (observation) and each column contains unique information about the player such as their name, age, height, weight, nationality, and much more. Here is a brief preview:
#Load in data and clean names
fifa_data <- read.csv("~/Downloads/archive/players_20.csv", header=TRUE, sep = ",")
fifa_data <-fifa_data %>%
clean_names()
fifa_data %>%
head()
## sofifa_id
## 1 158023
## 2 20801
## 3 190871
## 4 200389
## 5 183277
## 6 192985
## player_url
## 1 https://sofifa.com/player/158023/lionel-messi/20/159586
## 2 https://sofifa.com/player/20801/c-ronaldo-dos-santos-aveiro/20/159586
## 3 https://sofifa.com/player/190871/neymar-da-silva-santos-jr/20/159586
## 4 https://sofifa.com/player/200389/jan-oblak/20/159586
## 5 https://sofifa.com/player/183277/eden-hazard/20/159586
## 6 https://sofifa.com/player/192985/kevin-de-bruyne/20/159586
## short_name long_name age dob
## 1 L. Messi Lionel Andrés Messi Cuccittini 32 1987-06-24
## 2 Cristiano Ronaldo Cristiano Ronaldo dos Santos Aveiro 34 1985-02-05
## 3 Neymar Jr Neymar da Silva Santos Junior 27 1992-02-05
## 4 J. Oblak Jan Oblak 26 1993-01-07
## 5 E. Hazard Eden Hazard 28 1991-01-07
## 6 K. De Bruyne Kevin De Bruyne 28 1991-06-28
## height_cm weight_kg nationality club overall potential
## 1 170 72 Argentina FC Barcelona 94 94
## 2 187 83 Portugal Juventus 93 93
## 3 175 68 Brazil Paris Saint-Germain 92 92
## 4 188 87 Slovenia Atlético Madrid 91 93
## 5 175 74 Belgium Real Madrid 91 91
## 6 181 70 Belgium Manchester City 91 91
## value_eur wage_eur player_positions preferred_foot international_reputation
## 1 95500000 565000 RW, CF, ST Left 5
## 2 58500000 405000 ST, LW Right 5
## 3 105500000 290000 LW, CAM Right 5
## 4 77500000 125000 GK Right 3
## 5 90000000 470000 LW, CF Right 4
## 6 90000000 370000 CAM, CM Right 4
## weak_foot skill_moves work_rate body_type real_face release_clause_eur
## 1 4 4 Medium/Low Messi Yes 195800000
## 2 4 5 High/Low C. Ronaldo Yes 96500000
## 3 5 5 High/Medium Neymar Yes 195200000
## 4 3 1 Medium/Medium Normal Yes 164700000
## 5 4 4 High/Medium Normal Yes 184500000
## 6 5 4 High/High Normal Yes 166500000
## player_tags
## 1 #Dribbler, #Distance Shooter, #Crosser, #FK Specialist, #Acrobat, #Clinical Finisher, #Complete Forward
## 2 #Speedster, #Dribbler, #Distance Shooter, #Acrobat, #Clinical Finisher, #Complete Forward
## 3 #Speedster, #Dribbler, #Playmaker , #Crosser, #FK Specialist, #Acrobat, #Clinical Finisher, #Complete Midfielder, #Complete Forward
## 4
## 5 #Speedster, #Dribbler, #Acrobat
## 6 #Dribbler, #Playmaker , #Engine, #Distance Shooter, #Crosser, #Complete Midfielder
## team_position team_jersey_number loaned_from joined contract_valid_until
## 1 RW 10 2004-07-01 2021
## 2 LW 7 2018-07-10 2022
## 3 CAM 10 2017-08-03 2022
## 4 GK 13 2014-07-16 2023
## 5 LW 7 2019-07-01 2024
## 6 RCM 17 2015-08-30 2023
## nation_position nation_jersey_number pace shooting passing dribbling
## 1 NA 87 92 92 96
## 2 LS 7 90 93 82 89
## 3 LW 10 91 85 87 95
## 4 GK 1 NA NA NA NA
## 5 LF 10 91 83 86 94
## 6 RCM 7 76 86 92 86
## defending physic gk_diving gk_handling gk_kicking gk_reflexes gk_speed
## 1 39 66 NA NA NA NA NA
## 2 35 78 NA NA NA NA NA
## 3 32 58 NA NA NA NA NA
## 4 NA NA 87 92 78 89 52
## 5 35 66 NA NA NA NA NA
## 6 61 78 NA NA NA NA NA
## gk_positioning
## 1 NA
## 2 NA
## 3 NA
## 4 90
## 5 NA
## 6 NA
## player_traits
## 1 Beat Offside Trap, Argues with Officials, Early Crosser, Finesse Shot, Speed Dribbler (CPU AI Only), 1-on-1 Rush, Giant Throw-in, Outside Foot Shot
## 2 Long Throw-in, Selfish, Argues with Officials, Early Crosser, Speed Dribbler (CPU AI Only), Skilled Dribbling
## 3 Power Free-Kick, Injury Free, Selfish, Early Crosser, Speed Dribbler (CPU AI Only), Crowd Favourite
## 4 Flair, Acrobatic Clearance
## 5 Beat Offside Trap, Selfish, Finesse Shot, Speed Dribbler (CPU AI Only), Crowd Favourite
## 6 Power Free-Kick, Avoids Using Weaker Foot, Dives Into Tackles (CPU AI Only), Leadership, Argues with Officials, Finesse Shot
## attacking_crossing attacking_finishing attacking_heading_accuracy
## 1 88 95 70
## 2 84 94 89
## 3 87 87 62
## 4 13 11 15
## 5 81 84 61
## 6 93 82 55
## attacking_short_passing attacking_volleys skill_dribbling skill_curve
## 1 92 88 97 93
## 2 83 87 89 81
## 3 87 87 96 88
## 4 43 13 12 13
## 5 89 83 95 83
## 6 92 82 86 85
## skill_fk_accuracy skill_long_passing skill_ball_control movement_acceleration
## 1 94 92 96 91
## 2 76 77 92 89
## 3 87 81 95 94
## 4 14 40 30 43
## 5 79 83 94 94
## 6 83 91 91 77
## movement_sprint_speed movement_agility movement_reactions movement_balance
## 1 84 93 95 95
## 2 91 87 96 71
## 3 89 96 92 84
## 4 60 67 88 49
## 5 88 95 90 94
## 6 76 78 91 76
## power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1 86 68 75 68 94
## 2 95 95 85 78 93
## 3 80 61 81 49 84
## 4 59 78 41 78 12
## 5 82 56 84 63 80
## 6 91 63 89 74 90
## mentality_aggression mentality_interceptions mentality_positioning
## 1 48 40 94
## 2 63 29 95
## 3 51 36 87
## 4 34 19 11
## 5 54 41 87
## 6 76 61 88
## mentality_vision mentality_penalties mentality_composure defending_marking
## 1 94 75 96 33
## 2 82 85 95 28
## 3 90 90 94 27
## 4 65 11 68 27
## 5 89 88 91 34
## 6 94 79 91 68
## defending_standing_tackle defending_sliding_tackle goalkeeping_diving
## 1 37 26 6
## 2 32 24 7
## 3 26 29 9
## 4 12 18 87
## 5 27 22 11
## 6 58 51 15
## goalkeeping_handling goalkeeping_kicking goalkeeping_positioning
## 1 11 15 14
## 2 11 15 14
## 3 9 15 15
## 4 92 78 90
## 5 12 6 8
## 6 13 5 10
## goalkeeping_reflexes ls st rs lw lf cf rf rw lam cam ram
## 1 8 89+2 89+2 89+2 93+2 93+2 93+2 93+2 93+2 93+2 93+2 93+2
## 2 11 91+3 91+3 91+3 89+3 90+3 90+3 90+3 89+3 88+3 88+3 88+3
## 3 11 84+3 84+3 84+3 90+3 89+3 89+3 89+3 90+3 90+3 90+3 90+3
## 4 89
## 5 8 83+3 83+3 83+3 89+3 88+3 88+3 88+3 89+3 89+3 89+3 89+3
## 6 13 82+3 82+3 82+3 87+3 87+3 87+3 87+3 87+3 88+3 88+3 88+3
## lm lcm cm rcm rm lwb ldm cdm rdm rwb lb lcb cb rcb rb
## 1 92+2 87+2 87+2 87+2 92+2 68+2 66+2 66+2 66+2 68+2 63+2 52+2 52+2 52+2 63+2
## 2 88+3 81+3 81+3 81+3 88+3 65+3 61+3 61+3 61+3 65+3 61+3 53+3 53+3 53+3 61+3
## 3 89+3 82+3 82+3 82+3 89+3 66+3 61+3 61+3 61+3 66+3 61+3 46+3 46+3 46+3 61+3
## 4
## 5 89+3 83+3 83+3 83+3 89+3 66+3 63+3 63+3 63+3 66+3 61+3 49+3 49+3 49+3 61+3
## 6 88+3 87+3 87+3 87+3 88+3 77+3 77+3 77+3 77+3 77+3 73+3 66+3 66+3 66+3 73+3
It is also of value to import and load any R packages that are relevant towards analyzing data and building a model. The following libraries will be relevant for this project.
#Load packages
library(ggplot2)
library(tidyverse)
library(tidymodels)
library(corrplot)
library(ggthemes)
library(corrr)
library(janitor)
library(Hmisc)
library(discrim)
library(pROC)
library(klaR)
library(rpart.plot)
library(randomForest)
library(ranger)
library(xgboost)
library(kknn)
library(dplyr)
As one can see, many of the columns in our dataset will simply not be useful with respect to determining a player’s rating. For this reason, we will only select columns which have to do with the overall of a player. There are 34 columns in total which impact player rating and these are selected below. Please note that each one of these columns are numeric in nature as they indicate a player’s rating in that specific attribute.
#Isolate relevant columns from original dataset
new_fifa_data = subset(fifa_data, select= c(overall, attacking_crossing, attacking_finishing, attacking_heading_accuracy, attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accuracy, skill_long_passing, skill_ball_control, movement_acceleration, movement_sprint_speed, movement_agility, movement_reactions, movement_balance, power_shot_power, power_jumping, power_stamina, power_strength, power_long_shots, mentality_aggression, mentality_interceptions, mentality_positioning, mentality_vision, mentality_penalties, mentality_composure, defending_marking, defending_standing_tackle, defending_sliding_tackle, goalkeeping_diving, goalkeeping_handling, goalkeeping_kicking, goalkeeping_positioning, goalkeeping_reflexes))
After selecting and isolating the appropriate data, we now want to check for any missing data. I have spared you from looking at the very unclean and messy output but the important thing to note here is that we do not have any missing values in our dataset.
#check if any data is missing
is.na(new_fifa_data)
Finally, it is also worth considering whether we want to perform any scaling feature transformations on our data. For this reason, it is useful to create a histogram for every numerical column (in this case, every column).
#Create a historgram for every column
hist.data.frame(new_fifa_data, breaks=40)
As we can see above, each histogram has a value range of 0 to 100. This indicates that each attribute has the same scale and we consequently do no need to perform any feature scaling transformations. Now, we are ready to continue.
The next step is to split the overall data into training and testing data. Here, 80% of the data was split into training and 20% of the data was split into testing. Stratified sampling was utilized through stratifying on our outcome variable (overall).
Please note that the data split was conducted prior to any exploratory data analysis (EDA). This is because I do not want to risk the possibility of learning about the testing data before testing our model.
set.seed(3435)
fifa_split <- initial_split(new_fifa_data, prop = 0.80,
strata = overall)
fifa_train <- training(fifa_split)
fifa_test <- testing(fifa_split)
Now that the data has been split, we will find that the training data contains 14,611 observations and that the testing data contains 3,657 observations.
Please note that all of our EDA will be performed on the training data. This means that we will now be working with 14,611 observations rather than the original 18,278.
The first step to take is to examine correlations among the different predictors in the dataset. Given that our dataset has 34 predictors, displaying a correlation matrix in its totality will be visually unpleasant. For this reason, I have created a correlation matrix and identified all predictors with high levels correlation (>=|0.8|). These correlations are what we shall now focus on.
The correlation matrix indicates a strong correlation between the predictors defending_standing_tackle and defending_marking. Let’s examine this further:
#Create a scatterplot between defending_standing_tackle and defending_marking.
ggplot(fifa_train, aes(defending_standing_tackle, defending_marking)) +
geom_jitter(alpha = 0.1) +
geom_point(colour="red") +
geom_smooth(method=lm) +
xlab("Defending_Standing_Tackle") +
ylab("Defending_Marking")
The plot above indicates a strong positive correlation between defending_standing_tackle and defending_marking as a rating increase in one of these variables is associated with an increase in the other. While the correlation here may initially raise eyebrows, it is not nearly as surprising when considering the nature of the variables.
Defending_marking can best be described as a player’s ability to stay close to an opposing attacker and stop them from getting to a cross or pass from a teammate. Defending_standing tackle measures a player’s ability to time a tackle on their feet in order to win the ball and not commit a foul. Both of these attributes are an incredibly important part of defense. Therefore, it is no surprise that a player who is primarily a defender will likely be skilled in both these variables. Conversely, a player who is primarily an attacker will leave a lot to be desired with respect to both these variables.
A similar observation can be made with movement_sprint_speed and movement_acceleration:
#Create a scatterplot between movement_sprint_speed and movement_acceleration.
ggplot(fifa_train, aes(movement_sprint_speed, movement_acceleration)) +
geom_jitter(alpha = 0.1) +
geom_point(colour="green") +
geom_smooth(method=lm) +
xlab("Movement_Sprint_Speed") +
ylab("Movement_Acceleration")
The plot above indicates a strong positive correlation between movement_sprint_speed and movement_acceleration as an increase in one of these variables is associated with an increase in the other.
The variable movement_sprint_speed indicates how fast a player is relative to his peers. The variable movement_acceleration indicates how quickly a player can reach his highest velocity relative to his peers. It is no surprise that someone who is fast is also likely to have a high acceleration. Therefore, it makes sense that both of these variables exhibit strong positive correlation.
Both the correlations displayed above can provide insight into the relevancy of the predictors attached to these correlations. Perhaps we do not need all four of these predictors and are better off dropping one from each correlation.
The next correlation worth examining is between the variables of attacking_volleys and mentality_penalties. The relationship between these two variables is as such:
#Create a scatterplot between attacking_volleys and mentality_penalties.
ggplot(fifa_train, aes(attacking_volleys, mentality_penalties)) +
geom_jitter(alpha = 0.1) +
geom_point(colour= "yellow") +
geom_smooth(method=lm) +
xlab("Attacking_Volleys") +
ylab("Mentality_Penalties")
Looking at the plot above, we see a familiar story as both attacking_volleys and mentality_penalties share a positive correlation. However, the differentiating factor here is that these two variables are not naturally related to each other. Mentality_penalties can best be described as the confidence a player feels when taking a penalty kick. Attacking_volleys indicates a player’s ability to take a shot at goal when the ball is in the air. Neither one of these variables should ,in theory, increase together. This correlation is something to consider when we get deeper into model building.
A similar observation can be made with skill_ball_control and attacking_crossing.
#Create a scatterplot between skill_ball_control and attacking_crossing.
ggplot(fifa_train, aes(skill_ball_control, attacking_crossing)) +
geom_jitter(alpha = 0.1) +
geom_point(colour= "orange") +
geom_smooth(method=lm) +
xlab("Skill_Ball_Control") +
ylab("Attacking_Crossing")
We again see a strong positive correlation with the two variables listed above. Just like the last example, both skill_ball_control and attacking_crossing are not naturally related. Skill_ball_control can be defined as a player’s ability to control the ball when dribbling. Attacking_crossing can be defined as a player’s ability to deliver a pass from a wide position to a central area for the purpose of scoring a goal or building an attack. Neither one of these variables are similar in nature but we still see a strong positive correlation. Such a correlation is also something to consider when we get deeper into model building.
It is also worth exploring whether any predictors are correlated to the outcome variable ‘overall’ itself. The only predictor which has a strong correlation with our outcome variable is movement_reactions. We can illustrate the relationship between both variables below:
#Create a scatterplot between defending_standing_tackle and defending_marking.
ggplot(fifa_train, aes(movement_reactions, overall)) +
geom_jitter(alpha = 0.1) +
geom_point(colour="brown") +
geom_smooth(method=lm) +
xlab("Movement_Reactions") +
ylab("Overall")
Here we see a positive correlation between both movement_reactions and overall.
Not only do we see a positive correlation but our correlation matrix also indicates that we see a strong correlation. As we can see below, the correlation between both these variables is approximately .86.
#Display the correlation between both overall and movement_reactions
res <- cor.test(fifa_train$overall, fifa_train$movement_reactions,
method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: fifa_train$overall and fifa_train$movement_reactions
## t = 207.29, df = 14619, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8596248 0.8678560
## sample estimates:
## cor
## 0.863798
The strong positive correlation seen above is indicative of the fact that the predictor ‘movement_reactions’ has a linear relationship with the independent variable ‘overall’.
Lastly, it is worth looking at the distribution of our outcome variable.
#Create histogram for outcome variable
fifa_train %>%
ggplot(aes(x = overall)) +
geom_bar()
Here, we see that the majority of overall ratings fall in between a 60-80 overall, with the great majority specifically in between the 60-70 range. This makes sense as the average FIFA rating is generally believed to be in the low 60s. Overall, the variable seems to be normally distributed which is a good sign as the data do not violate the normality assumption.
Now, it is time to start building models! The first step to complete
is to create a recipe predicting the outcome variable,
overall. For this recipe, we shall utilize all predictors
in our dataset fifa_train.
fifa_recipe <- recipe(overall ~., fifa_train) %>%
step_normalize(all_predictors())
Due to the high number of predictors in the fifa_train dataset, I have been advised (by a TA) to utilize repeated cross-fold validation in the model building process. For this reason, I will use k-fold cross-validation, with \(k = 10\).
fifa_folds <- vfold_cv(fifa_train, v = 10)
fifa_folds
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [13158/1463]> Fold01
## 2 <split [13159/1462]> Fold02
## 3 <split [13159/1462]> Fold03
## 4 <split [13159/1462]> Fold04
## 5 <split [13159/1462]> Fold05
## 6 <split [13159/1462]> Fold06
## 7 <split [13159/1462]> Fold07
## 8 <split [13159/1462]> Fold08
## 9 <split [13159/1462]> Fold09
## 10 <split [13159/1462]> Fold10
The first type of model worth considering is one of Linear
Regression. In order to examine this further, let us set up a workflow
for a linear regression model with the lm engine:
fifa_linear_regression <- linear_reg() %>%
set_engine("lm")
linear_reg_wflow <- workflow() %>%
add_model(fifa_linear_regression) %>%
add_recipe(fifa_recipe)
Now, we will pass the necessary objects to tune_grid(), which will fit the models within each fold.
#Fit models to the folds created previously
tune_fifa <- tune_grid(
object = linear_reg_wflow,
resamples = fifa_folds
)
In order to evaluate our models, we need to select a performance measure. For this, I will be using the Root Mean Square Error (RMSE). The RMSE is the calculation of the residuals (prediction errors) with respect to the line of best fit. This metric is the preferred performance measure for regression problems and will give an idea as to how much error a model makes in its prediction.
After fitting the models to the folds, we can select the best-performing model (the one with the lowest RMSE) and fit this model to the training data.
best_linear_regression_model <- select_best(tune_fifa, degree, metric = "rmse") #Select best parameter value
final_linear_reg_wkflow <- finalize_workflow(linear_reg_wflow, fifa_train) #Finalize workflow
linear_reg_final_fit <- fit(final_linear_reg_wkflow, data = fifa_train) #Fit model
At last, we can determine the RMSE for the linear regression model.
#Determine the RMSE of the model
augment(linear_reg_final_fit, new_data = fifa_train) %>%
rmse(truth = overall, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.49
As we can see, Linear Regression model produces an RMSE of approximately 2.49. While this score is not bad, it is worth considering and evaluating the performance measure of other models.
#Regression Tree
The next model worth considering is a Regression Tree. This is a powerful model, capable of finding complex nonlinear relationships in the data. To begin this process, let us first create a general decision tree specification and a regression decision tree engine.
#Set specification and engine
tree_specification <- decision_tree() %>%
set_engine("rpart")
regression_spec <- tree_specification %>%
set_mode("regression")
Now we can fit the model.
regression_tree_fit <- fit(regression_spec,
overall ~ ., data = fifa_train)
Currently, we have the ability to visualize the regression tree model through the code below. However, this model will likely be overfit to the training data, consequently resulting in a failure to accurately make predictions on the testing data.
#Display current tree
regression_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
In order to prevent overfitting, we will apply a “pruning penalty” to
the decision tree in the segment below. We will specifically tune the
cost_complexity of the decision tree in order to find a
more optimal complexity.
regression_tree_wflow <- workflow() %>% #pass through workflow in order to avoid creating an object
add_model(regression_spec %>% set_args(cost_complexity = tune())) %>% #specify that we want to tune `cost_complexity'
add_formula(overall ~ .)
set.seed(3435)
new_fifa_fold <- vfold_cv(fifa_train) #Create a K-fold cross validation set
parameter_grid <- grid_regular(cost_complexity(range = c(-4, -1)), levels = 10) #Create a grid of values to try
new_tune_fifa <- tune_grid(
regression_tree_wflow,
resamples = new_fifa_fold,
grid = parameter_grid
)
Now we can create a visualization to compare which values of cost_complexity appear to produce the highest accuracy.
autoplot(new_tune_fifa)
Luckily for us, we can automatically select the best performing value through the code below rather than eyeballing the graph above. We then finalize the workflow and fit the model on the full training data set.
fifa_best_complexity <- select_best(new_tune_fifa, metric = "rmse") #Select best parameter value
regression_tree_final <- finalize_workflow(regression_tree_wflow, fifa_best_complexity) #Finalize workflow
regression_tree_final_fit <- fit(regression_tree_final, data = fifa_train) #Fit model
We can now visualize the final version of our regression tree model.
regression_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
Now, we can determine the performance on the training data.
#Determine the RMSE of the model
augment(regression_tree_final_fit, new_data = fifa_train) %>%
rmse(truth = overall, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.67
As we can see, the Regression Tree model produces an RMSE of approximately 1.67. This score is significantly better than the Linear Regression RMSE and is indicative of the fact that a Regression Tree model is likely superior to a Linear Regression model with respect to making predictions based off of our data.
Now, it is worth training a Random Forest model. This model is even more powerful than those that were tested before as the Random Forest model is a collection of decision trees whose results are aggregated into one final result. To begin this process, let us first create a random forest tree specification. Then, we can tune min_n, trees, and mtry, set mode to “regression”, and use the ranger engine.
forest_fifa <-
rand_forest(
min_n = tune(),
mtry = tune(),
trees = tune(),
mode = "regression") %>%
set_engine("ranger")
forest_workflow <- workflow() %>%
#Store the model and the recipe into the workflow
add_model(forest_fifa) %>%
add_recipe(fifa_recipe)
Next, I set up the tuning grid with 2 levels (this was coordinated with the TA). I selected 10 for the maximum mtry range as it is roughly 1/3rd of the number of predictors in the data set (34).
fifa_params <- parameters(forest_fifa) %>%
update(mtry = mtry(range= c(2, 10)))
#Define grid
forest_grid <- grid_regular(fifa_params, levels = 2)
Now it is time to execute the model through tuning and fitting.
forest_tune <- forest_workflow %>%
tune_grid(
resamples = fifa_folds,
grid = forest_grid)
Let us now take a look at the rmse through the autoplot() function.
autoplot(forest_tune, metric = 'rmse')
Here, we can see that the rmse decreases as the number of randomly
selected predictors increases. This makes sense as more player data
means there is a higher probability of correctly guessing player
overall.
We can then select the best-performing model and fit it to the training data.
best_random_forest_model <- select_best(forest_tune, metric = "rmse") #Select best parameter value
final_random_reg_wkflow <- finalize_workflow(forest_workflow, best_random_forest_model) #Finalize workflow
random_forest_final_fit <- fit(final_random_reg_wkflow, data = fifa_train) #Fit model
Let’s look at the RMSE for this model:
#Determine the RMSE of the model
augment(random_forest_final_fit, new_data = fifa_train) %>%
rmse(truth = overall, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.416
The random forest model has an rmse of approximately .42. This score is significantly better than both the Linear Regression and Regression Tree models and is indicative of the fact that a Random Forest model is currently the most superior model with respect to making predictions based off of our data.
#Boosted Trees
The next model worth examining is a Boosted Trees model. Such a model is also significantly more powerful than the Linear Regression and Regression Tree models that we fit earlier. A Boosted Trees model is very similar to a Random Forests model as a collection of decision trees are utilized in both model types. The main difference, however, lies in how the decision trees are created and aggregated. Unlike random forests, the decision trees in Boosted Trees are not built independently. Instead, the trees are built additively (one after another).
Let us first create a boosted tree specification. Then, we can tune min_n and mtry, set mode to “regression”, and use the xgboost engine.
boosted_fifa <- boost_tree(mode = "regression",
min_n = tune(),
mtry = tune(),
) %>%
set_engine("xgboost")
boosted_workflow <- workflow() %>%
add_model(boosted_fifa) %>%
add_recipe(fifa_recipe)
Next, I once again set up a tuning grid with 2 levels. I also selected 10 for the maximum mtry again.
boosted_parameters <- parameters(boosted_fifa) %>%
update(mtry = mtry(range= c(2, 10)),
)
# define grid
boosted_grid <- grid_regular(boosted_parameters, levels = 2)
Now it’s time to execute the model through tuning and fitting.
boosted_tune <- boosted_workflow %>%
tune_grid(
resamples = fifa_folds,
grid = boosted_grid
)
Once again, we can take a look at the rmse through the autoplot() function.
autoplot(boosted_tune, metric = 'rmse')
We can then select the best-performing model and fit it to the training data.
best_boosted_tree_model <- select_best(boosted_tune, metric = "rmse") #Select best parameter
final_boosted_tree_wkflow <- finalize_workflow(boosted_workflow, best_boosted_tree_model) #Finalize workflow
boosted_tree_final_fit <- fit(final_boosted_tree_wkflow, data = fifa_train) #Fit model
Let’s look at the RMSE for this model:
augment(boosted_tree_final_fit, new_data = fifa_train) %>%
rmse(truth = overall, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.44
The boosted trees model has an rmse of approximately 1.34. While this score is significantly better than the linear regression model and slightly better than the regression tree model, it is significantly worse than the random forest model. So far, our random forest model seems to determine player overall best.
The final model worth considering is a K-Nearest Neighbors model. The K-Nearest Neighbors model is nonparametric in nature and significantly more simple than that of random forests or boosted trees. The model uses feature similarity in order to predict a given data point and is often overlooked in the context of predicting for regression.
We start off by creating a nearest neighbor specification and tuning neighbors(). We can also set mode to “regression”, and use the kknn engine.
knn_fifa_model <-
nearest_neighbor(
neighbors = tune(),
mode = "regression") %>%
set_engine("kknn")
knn_workflow <- workflow() %>%
add_model(knn_fifa_model) %>%
add_recipe(fifa_recipe)
Once again, we set up a tuning grid with 2 levels.
# set-up tuning grid
knn_parameters <- parameters(knn_fifa_model)
# define grid
knn_grid <- grid_regular(knn_parameters, levels = 2)
Thw next step is to execute the model through tuning and fitting.
knn_fifa_tune <- knn_workflow %>%
tune_grid(
resamples = fifa_folds,
grid = knn_grid)
As done previously, let us utilize the autoplot() function.
autoplot(knn_fifa_tune, metric = "rmse")
Then, we select the best-performing model and fit it to the training data.
best_knn_model <- select_best(knn_fifa_tune, metric = "rmse") #Select best parameter
final_knn_wkflow <- finalize_workflow(knn_workflow, best_knn_model) #Finalize workflow
knn_final_fit <- fit(final_knn_wkflow, data = fifa_train) #Fit model
We can now determine the RMSE.
augment(knn_final_fit, new_data = fifa_train) %>%
rmse(truth = overall, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.32
The k-nearest neighbors model produced an RMSE of 1.32. This score is comparable to the boosted trees and regression tree models but lags behind that of the random forest model.
It is clear that the random forest model has the lowest RMSE and is therefore the best performing model. For this reason, we will continue with the random forest model.
Now that we have identified the model we would like to continue with, it is time to fit this model on the testing data. Given that the model is being evaluated on new data, we will have the
We’ll create a new workflow and finalize the workflow by taking the parameters from the best model (the random forest model) using select_best().
random_forest_workflow_final <- forest_workflow %>%
finalize_workflow(select_best(forest_tune, metric = "rmse"))
Then, we can run the fit.
random_forest_final_results <- fit(random_forest_workflow_final, fifa_train)
Now, we fit the model to the testing data set and determine the rmse.
augment(random_forest_final_results, new_data = fifa_train) %>%
rmse(truth = overall, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.416